Towards local visual modeling for image captioning
نویسندگان
چکیده
In this paper, we study the local visual modeling with grid features for image captioning, which is critical generating accurate and detailed captions. To achieve target, propose a Locality-Sensitive Transformer Network (LSTNet) two novel designs, namely Attention (LSA) Fusion (LSF). LSA deployed intra-layer interaction in via relationship between each its neighbors. It reduces difficulty of object recognition during captioning. LSF used inter-layer information fusion, aggregates different encoder layers cross-layer semantical complementarity. With these proposed LSTNet can model to improve captioning quality. validate LSTNet, conduct extensive experiments on competitive MS-COCO benchmark. The experimental results show that not only capable modeling, but also outperforms bunch state-of-the-art models offline online testings, i.e., 134.8 CIDEr 136.3 CIDEr, respectively. Besides, generalization verified Flickr8k Flickr30k datasets. source code available GitHub: https://www.github.com/xmu-xiaoma666/LSTNet.
منابع مشابه
Image Captioning using Visual Attention
This project aims at generating captions for images using neural language models. There has been a substantial increase in number of proposed models for image captioning task since neural language models and convolutional neural networks(CNN) became popular. Our project has its base on one of such works, which uses a variant of Recurrent neural network coupled with a CNN. We intend to enhance t...
متن کاملOracle Performance for Visual Captioning
The task of associating images and videos with a natural language description has attracted a great amount of attention recently. The state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates performances that an oracle can obtain...
متن کاملContrastive Learning for Image Captioning
Image captioning, a popular topic in computer vision, has achieved substantial progress in recent years. However, the distinctiveness of natural descriptions is often overlooked in previous work. It is closely related to the quality of captions, as distinctive captions are more likely to describe images with their unique aspects. In this work, we propose a new learning method, Contrastive Learn...
متن کاملStack-Captioning: Coarse-to-Fine Learning for Image Captioning
The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multistage prediction framework for image captioning, composed of multiple decoders each of which...
متن کاملSocial Image Captioning: Exploring Visual Attention and User Attention
Image captioning with a natural language has been an emerging trend. However, the social image, associated with a set of user-contributed tags, has been rarely investigated for a similar task. The user-contributed tags, which could reflect the user attention, have been neglected in conventional image captioning. Most existing image captioning models cannot be applied directly to social image ca...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Pattern Recognition
سال: 2023
ISSN: ['1873-5142', '0031-3203']
DOI: https://doi.org/10.1016/j.patcog.2023.109420